26 research outputs found

    Near-optimal bounds for phase synchronization

    Full text link
    The problem of phase synchronization is to estimate the phases (angles) of a complex unit-modulus vector zz from their noisy pairwise relative measurements C=zz+σWC = zz^* + \sigma W, where WW is a complex-valued Gaussian random matrix. The maximum likelihood estimator (MLE) is a solution to a unit-modulus constrained quadratic programming problem, which is nonconvex. Existing works have proposed polynomial-time algorithms such as a semidefinite relaxation (SDP) approach or the generalized power method (GPM) to solve it. Numerical experiments suggest both of these methods succeed with high probability for σ\sigma up to O~(n1/2)\tilde{\mathcal{O}}(n^{1/2}), yet, existing analyses only confirm this observation for σ\sigma up to O(n1/4)\mathcal{O}(n^{1/4}). In this paper, we bridge the gap, by proving SDP is tight for σ=O(n/logn)\sigma = \mathcal{O}(\sqrt{n /\log n}), and GPM converges to the global optimum under the same regime. Moreover, we establish a linear convergence rate for GPM, and derive a tighter \ell_\infty bound for the MLE. A novel technique we develop in this paper is to track (theoretically) nn closely related sequences of iterates, in addition to the sequence of iterates GPM actually produces. As a by-product, we obtain an \ell_\infty perturbation bound for leading eigenvectors. Our result also confirms intuitions that use techniques from statistical mechanics.Comment: 34 pages, 1 figur

    The Interpolation Phase Transition in Neural Networks: Memorization and Generalization under Lazy Training

    Full text link
    Modern neural networks are often operated in a strongly overparametrized regime: they comprise so many parameters that they can interpolate the training set, even if actual labels are replaced by purely random ones. Despite this, they achieve good prediction error on unseen data: interpolating the training set does not induce overfitting. Further, overparametrization appears to be beneficial in that it simplifies the optimization landscape. Here we study these phenomena in the context of two-layers neural networks in the neural tangent (NT) regime. We consider a simple data model, with isotropic feature vectors in dd dimensions, and NN hidden neurons. Under the assumption NCdN \le Cd (for CC a constant), we show that the network can exactly interpolate the data as soon as the number of parameters is significantly larger than the number of samples: NdnNd\gg n. Under these assumptions, we show that the empirical NT kernel has minimum eigenvalue bounded away from zero, and characterize the generalization error of min-2\ell_2 norm interpolants, when the target function is linear. In particular, we show that the network approximately performs ridge regression in the raw features, with a strictly positive `self-induced' regularization.Comment: 69 pages, 4 figure

    Differentially Private Data Releasing for Smooth Queries with Synthetic Database Output

    Full text link
    We consider accurately answering smooth queries while preserving differential privacy. A query is said to be KK-smooth if it is specified by a function defined on [1,1]d[-1,1]^d whose partial derivatives up to order KK are all bounded. We develop an ϵ\epsilon-differentially private mechanism for the class of KK-smooth queries. The major advantage of the algorithm is that it outputs a synthetic database. In real applications, a synthetic database output is appealing. Our mechanism achieves an accuracy of O(nK2d+K/ϵ)O (n^{-\frac{K}{2d+K}}/\epsilon ), and runs in polynomial time. We also generalize the mechanism to preserve (ϵ,δ)(\epsilon, \delta)-differential privacy with slightly improved accuracy. Extensive experiments on benchmark datasets demonstrate that the mechanisms have good accuracy and are efficient

    Unraveling Projection Heads in Contrastive Learning: Insights from Expansion and Shrinkage

    Full text link
    We investigate the role of projection heads, also known as projectors, within the encoder-projector framework (e.g., SimCLR) used in contrastive learning. We aim to demystify the observed phenomenon where representations learned before projectors outperform those learned after -- measured using the downstream linear classification accuracy, even when the projectors themselves are linear. In this paper, we make two significant contributions towards this aim. Firstly, through empirical and theoretical analysis, we identify two crucial effects -- expansion and shrinkage -- induced by the contrastive loss on the projectors. In essence, contrastive loss either expands or shrinks the signal direction in the representations learned by an encoder, depending on factors such as the augmentation strength, the temperature used in contrastive loss, etc. Secondly, drawing inspiration from the expansion and shrinkage phenomenon, we propose a family of linear transformations to accurately model the projector's behavior. This enables us to precisely characterize the downstream linear classification accuracy in the high-dimensional asymptotic limit. Our findings reveal that linear projectors operating in the shrinkage (or expansion) regime hinder (or improve) the downstream classification accuracy. This provides the first theoretical explanation as to why (linear) projectors impact the downstream performance of learned representations. Our theoretical findings are further corroborated by extensive experiments on both synthetic data and real image data

    Tractability from overparametrization: The example of the negative perceptron

    Full text link
    In the negative perceptron problem we are given nn data points (xi,yi)({\boldsymbol x}_i,y_i), where xi{\boldsymbol x}_i is a dd-dimensional vector and yi{+1,1}y_i\in\{+1,-1\} is a binary label. The data are not linearly separable and hence we content ourselves to find a linear classifier with the largest possible \emph{negative} margin. In other words, we want to find a unit norm vector θ{\boldsymbol \theta} that maximizes mininyiθ,xi\min_{i\le n}y_i\langle {\boldsymbol \theta},{\boldsymbol x}_i\rangle. This is a non-convex optimization problem (it is equivalent to finding a maximum norm vector in a polytope), and we study its typical properties under two random models for the data. We consider the proportional asymptotics in which n,dn,d\to \infty with n/dδn/d\to\delta, and prove upper and lower bounds on the maximum margin κs(δ)\kappa_{\text{s}}(\delta) or -- equivalently -- on its inverse function δs(κ)\delta_{\text{s}}(\kappa). In other words, δs(κ)\delta_{\text{s}}(\kappa) is the overparametrization threshold: for n/dδs(κ)εn/d\le \delta_{\text{s}}(\kappa)-\varepsilon a classifier achieving vanishing training error exists with high probability, while for n/dδs(κ)+εn/d\ge \delta_{\text{s}}(\kappa)+\varepsilon it does not. Our bounds on δs(κ)\delta_{\text{s}}(\kappa) match to the leading order as κ\kappa\to -\infty. We then analyze a linear programming algorithm to find a solution, and characterize the corresponding threshold δlin(κ)\delta_{\text{lin}}(\kappa). We observe a gap between the interpolation threshold δs(κ)\delta_{\text{s}}(\kappa) and the linear programming threshold δlin(κ)\delta_{\text{lin}}(\kappa), raising the question of the behavior of other algorithms.Comment: 88 pages; 7 pdf figure
    corecore